Lightweight Fault-tolerance for Highly Cooperative Distributed Applications

نویسندگان

  • Lorenzo Alvisi
  • Sriram Rao
  • Harrick M. Vin
چکیده

The recent introduction of high-speed networks, faster processors, and the rapid growth of heterogeneous large-scale distributed systems has enabled the development of distributed applications that move beyond the client-server model to truly harness the computational potential of distributed systems. These new applications will be structured around groups of agents that communicate using messages as well as files. Some of these emerging applications will be critical enough to life or business to warrant explicit process replication to achieve high availability. Often, however, explicit replication will be too costly to implement, or, simply, high availability will not be necessary. In these circumstances, the availability of low-overhead fault-tolerance techniques will be crucial to achieving reliability. To address these needs, we are developing lightweight fault-tolerance (LFT), a new low-overhead approach to fault-tolerance for highly cooperative distributed applications. In the first part of this paper, we describe how LFT extends to file communication the causal logging techniques used in message passing. We show how in our approach all the synchronous operations that are currently performed by log-based protocols during file I/O are either eliminated or made asynchronous, therefore removing the opportunities for blocking. Furthermore, we argue that our approach has the potential to enhance the effectiveness of existing rollback recovery techniques for software fault-tolerance. In the second part of the paper, we validate LFT through extensive simulation. Our results indicate that LFT brings the cost of file communication down to the level of message passing, drastically reducing the overhead incurred by fault-tolerant applications in performing file I/O. 1

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lightweight Message Logging Protocol for Distributed Sensor Networks

Among a lot of rollback-recovery protocols developed for providing fault-tolerance for long-running distributed applications, sender-based message logging with checkpointing is one of the most lightweight fault-tolerance techniques to be capable of being applied in this field, significantly decreasing high failure-free overhead of synchronous logging by using message sender's volatile memory as...

متن کامل

Application Aware for Byzantine Fault Tolerance

Driven by the need for higher reliability of many distributed systems, various replication-based fault tolerance technologies have been widely studied. A prominent technology is Byzantine fault tolerance (BFT). BFT can help achieve high availability and trustworthiness by ensuring replica consistency despite the presence of hardware failures and malicious faults on a small portion of the replic...

متن کامل

Improving the palbimm scheduling algorithm for fault tolerance in cloud computing

Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...

متن کامل

On Applications of Cooperative Security in Distributed Networks

Many applications running on the Internet operate in fully or semi-distributed fashion including P2P networks or social networks. Distributed applications exhibit many advantages over classical client-server models regarding scalability, fault tolerance, and cost. Unfortunately, the distributed system operation also brings many security threats along that challenge their performance and reliabi...

متن کامل

Towards Middleware for Fault-Tolerance in Distributed Real-Time and Embedded Systems

Distributed real-time and embedded (DRE) systems often require support for multiple simultaneous quality of service (QoS) properties, such as real-timeliness and fault tolerance, that operate within resource constrained environments. These resource constraints motivate the need for a lightweight middleware infrastructure, while the need for simultaneous QoS properties require the middleware to ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007